Data

Predictors: body mass interacting with habitat, effect of >10 kg, (maybe) diet.

Read in data

mammal_bmr <- read.csv("data/bmr.csv") 

Notes

  • logbmr is the base-10 logarithm of the BMR in ml \(O_2\) per hour
  • Log.of.BMR is the base-10 logarithm of the BMR in kcal/day
  • Log.of.Mass or logmb is the base-10 logarithm of the mass in kg

Cleaning

Rename some variables and create species name from genus and species. (There is already a variable in the dataset Species.Name but just to be safe and 100% sure of consistency with the genus/species values that may be used in random effects models, building it here.)

mammal_bmr <- mammal_bmr %>%
  rename(source = Source,
         order = order.corrected,
        # n.animals = Number.of.Animals,
         mass.kg = body.mass..kg.) %>%
  mutate(order = str_to_title(order),
         genus = str_to_title(genus),
         animal = paste(genus, species),
         above.10kg = ifelse(mass.kg >= 10, 'Larger', 'Smaller'))

Keep only variables we will be using. And “factor” “chr” variables.

mammal_bmr <- mammal_bmr %>%
  dplyr::select(order,
                genus,
                species, 
                common.name,
                mass.kg,
                log10.body.mass,
                log10.bmr,
                diet,
                habitat,
                source,
                animal,
                above.10kg
         ) %>%
  mutate(across(where(is.character), factor)) %>%
  arrange(order, genus, species)

Viz

A few quick graphs just to make sure the data are looking as we expect (error checking).

my_scatter_plot <- gf_point(log10.bmr ~ log10.body.mass | habitat,
         color = ~order,
         # size = ~parse_number(n_animals),
         data = mammal_bmr,
         alpha = 0.5) %>%
  gf_theme(legend.position = 'bottom',
           legend.title = element_text(size = 8),
           legend.text = element_text(size = 6)) %>%
  gf_theme(scale_color_viridis_d('Order')) %>%
  gf_labs(x = 'Log10(Mass (kg))',
          y = 'Log10(BMR (kcal/day))') 
my_scatter_plot

plotly::ggplotly(my_scatter_plot) %>%
  plotly::layout(legend = list(#orientation = 'h',
                               font = list(size = 6)))

Notes: If we wanted to use for web or publication, we would refine. We can edit what info is shown when you hover over a data point, change color scheme, legend, etc.

GLS

Will not account for phylogeny at all in the model structure. Predictors: body mass interacting with habitat, effect of >10 kg, (maybe) diet.

lm_model <- lm(log10.bmr ~ log10.body.mass * habitat + 
                 log10.body.mass*above.10kg + diet,
                 data = mammal_bmr)
tab_model(lm_model) 
  log10.bmr
Predictors Estimates CI p
(Intercept) 1.86 1.68 – 2.05 <0.001
log10.body.mass 0.77 0.68 – 0.86 <0.001
habitat [land] -0.20 -0.38 – -0.02 0.026
above.10kg [Smaller] 0.02 -0.12 – 0.16 0.774
diet [h] 0.09 0.05 – 0.12 <0.001
diet [o] 0.05 0.01 – 0.08 0.003
log10.body.mass * habitat
[land]
0.02 -0.07 – 0.12 0.646
log10.body.mass *
above.10kg [Smaller]
-0.12 -0.20 – -0.03 0.008
Observations 722
R2 / R2 adjusted 0.963 / 0.962
lm_anova_results <- car::Anova(lm_model)
pander(lm_anova_results)
Anova Table (Type II tests)
  Sum Sq Df F value Pr(>F)
log10.body.mass 214.9 1 8570 0
habitat 0.3411 1 13.61 0.0002427
above.10kg 0.6022 1 24.02 1.183e-06
diet 0.6796 2 13.55 1.672e-06
log10.body.mass:habitat 0.005289 1 0.2109 0.6462
log10.body.mass:above.10kg 0.178 1 7.098 0.007893
Residuals 17.9 714 NA NA

For more on formatting the fitted model table see: https://strengejacke.github.io/sjPlot/articles/tab_model_estimates.html

Model Assessment

lm_preds <- predict(lm_model, se.fit = TRUE)

mammal_bmr <- mammal_bmr  %>%
  mutate(lm_resids = resid(lm_model),
         lm_fitted = lm_preds$fit,
         lm_fit_lo = lm_preds$fit + 1.96*lm_preds$se.fit,
         lm_fit_hi = lm_preds$fit - 1.96*lm_preds$se.fit)
gf_point(lm_resids ~ lm_fitted, data = mammal_bmr)

acf(resid(lm_model))

gf_dhistogram(~lm_resids, data = mammal_bmr,
              bins = 20) %>%
  gf_fitdistr()

Model Predictions

Any predictors not shown in a plot are held constant at their mean or most common value.

gf_line(lm_fitted ~ log10.body.mass,
         color = ~habitat,
         data = mammal_bmr) %>%
  gf_ribbon(lm_fit_lo + lm_fit_hi ~ log10.body.mass,
            color = ~habitat, fill = ~habitat) %>%
  gf_facet_grid(~ diet) %>% 
  gf_theme(scale_color_manual(values = my_colors)) %>%
  gf_theme(scale_fill_manual(values = my_colors))

gf_point(log10.bmr ~ lm_fitted, data = mammal_bmr,
         alpha = 0.1) %>%
  gf_labs(y = 'Observed log10(BMR)',
          x = 'Model-predicted log10(BMR)',
          title = 'Linear Regresssion (no phylogeny)') %>%
  gf_abline(slope = 1, intercept = 0, color = 'black', linetype = 'dashed')

Can refine the plot of predictions later on.

Mixed-effects model

Will include nested random effects of order/family/genus/species, expecting similarity of observations based on a hierarchy of phylogenetic relatedness, but not in a specified/structured way; only groupings are used, with no sense of (for example) the fact that two orders are thought to be “further” apart than any other two.

# this code may generate warnings that are harmless
# https://stackoverflow.com/questions/67040472/warning-in-every-model-of-glmmtmb-givecsparse
re_model <- glmmTMB(log10.bmr ~ log10.body.mass * habitat +
                      log10.body.mass*above.10kg + diet +
                      (1 | order/genus/species),
                 data = mammal_bmr)

tab_model(re_model) 
  log10.bmr
Predictors Estimates CI p
(Intercept) 1.83 1.65 – 2.00 <0.001
log10.body.mass 0.74 0.66 – 0.83 <0.001
habitat [land] -0.25 -0.40 – -0.09 0.002
above.10kg [Smaller] 0.06 -0.06 – 0.19 0.332
diet [h] 0.05 0.01 – 0.09 0.014
diet [o] 0.02 -0.01 – 0.06 0.175
log10.body.mass * habitat
[land]
0.04 -0.05 – 0.13 0.367
log10.body.mass *
above.10kg [Smaller]
-0.11 -0.19 – -0.02 0.014
Random Effects
σ2 0.00
τ00 species:(genus:order) 0.01
τ00 genus:order 0.01
τ00 order 0.01
ICC 0.97
N species 626
N genus 415
N order 24
Observations 722
Marginal R2 / Conditional R2 0.950 / 0.999
re_anova_results <- car::Anova(re_model)
pander(re_anova_results)
Analysis of Deviance Table (Type II Wald chisquare tests)
  Chisq Df Pr(>Chisq)
log10.body.mass 5100 1 0
habitat 19.95 1 7.964e-06
above.10kg 8.212 1 0.004161
diet 6.304 2 0.04277
log10.body.mass:habitat 0.8136 1 0.367
log10.body.mass:above.10kg 5.983 1 0.01445

Model Assessment

re_ave_preds <- predict(re_model, 
                    se.fit = TRUE,
                    re.form = ~0)
re_ind_preds <- predict(re_model,
                        se.fit = TRUE,
                        re.form = NULL)
mammal_bmr <- mammal_bmr %>%
  mutate(re_resids = resid(re_model),
         re_ind_fitted = re_ind_preds$fit,
         re_ave_fitted = re_ave_preds$fit,
         re_ave_lo = re_ave_preds$fit - 1.96*re_ave_preds$se.fit,
         re_ave_hi = re_ave_preds$fit + 1.96*re_ave_preds$se.fit)

# save fitted model and data
saveRDS(mammal_bmr, 'fitted-models/mr-data.RDS')
saveRDS(re_model, 'fitted-models/mr-re-model.RDS')


gf_point(re_resids ~ re_ind_fitted, data = mammal_bmr)

acf(resid(re_model))

gf_dhistogram(~re_resids, data = mammal_bmr) %>%
  gf_fitdistr()

Predictions from Model

mammal_bmr <- mammal_bmr %>%
  mutate(pretty_diet = case_when(diet == 'c'~ 'Carnivore',
                                 diet == 'h' ~ 'Herbivore',
                                 diet == 'o' ~ 'Omnivore'),
         pretty_diet = factor(pretty_diet, levels = c('Herbivore', 'Omnivore', 'Carnivore')),
         Habitat = stringr::str_to_sentence(habitat))

gf_point(log10.bmr ~ log10.body.mass,
         color = ~Habitat, data = mammal_bmr,
         size = 0.5, alpha = 0.4) %>%
gf_line(re_ave_fitted ~ log10.body.mass,
         color = ~Habitat,
         data = mammal_bmr) %>%
  gf_ribbon(re_ave_lo + re_ave_hi ~ log10.body.mass,
            color = ~Habitat, fill = ~Habitat) %>%
  gf_facet_grid(~ pretty_diet) %>%
  gf_labs(x = 'Log10(Body Mass (kg))', y = 'Log10(BMR)\n(Dots are observed, lines predicted)') %>% 
  gf_theme(scale_color_manual(values = my_colors)) %>%
  gf_theme(scale_fill_manual(values = my_colors))

We wanted a graph of “aquatic vs terrestrial” – is this what we meant? (My notes don’t say any more about exactly what is to be shown…)

gf_point(log10.bmr ~ re_ind_fitted, data = mammal_bmr,
         alpha = 0.1) %>%
  gf_labs(y = 'Observed log10(BMR)',
          x = 'Model-predicted, Species-specific log10(BMR)',
          title = 'Mixed-effects Model (RE of Order/Genus/Species)') %>%
  gf_abline(slope = 1, intercept = 0, color = 'black', linetype = 'dashed')

The graph above is a bit “cheating” as we have a random effect of species, but there are only 1-2 measurements for most of the species (nearly guaranteeing that our estimates will be nearly perfect). What if we also check out the predictions accounting for the modeled effects of Order and Genus, but predicting for the “average” species in each Genus?

no_species <- mammal_bmr %>%
  mutate(species = NA)
re_genus_level_preds <- predict(re_model, 
                    se.fit = TRUE,
                    re.form = NULL,
                    newdata = no_species)

gf_point(log10.bmr ~ re_genus_level_preds$fit, data = mammal_bmr,
         alpha = 0.1) %>%
  gf_labs(y = 'Observed log10(BMR)',
          x = 'Model-predicted, Genus-specific log10(BMR)',
          title = 'Mixed-effects Model (RE of Order/Genus/Species)') %>%
  gf_abline(slope = 1, intercept = 0, color = 'black', linetype = 'dashed')

Alternate version: Mass Weightings

Weight observations by body mass, binned into log mass bins. The goal is to reduce the undue influence of many data points collected from very small animals.

nr <- nrow(mammal_bmr)
mammal_bmr <- mammal_bmr %>%
  mutate(log.mass.bin = cut_width(log10.body.mass,
                                  width = 1,
                                  boundary = log10(0.001))) %>%
  group_by(log.mass.bin) %>%
  mutate(log.mass.weights = 1 / n())

mammal_bmr <- mammal_bmr %>%
  # make weights sum to nrow(mammal_bmr)
  mutate(log.mass.weights = log.mass.weights / sum(mammal_bmr$log.mass.weights) * nrow(mammal_bmr)) %>%
  ungroup()
# this code may generate warnings that are harmless
# https://stackoverflow.com/questions/67040472/warning-in-every-model-of-glmmtmb-givecsparse
wt_re_model <- glmmTMB(log10.bmr ~ log10.body.mass * habitat +
                      log10.body.mass*above.10kg + diet +
                      (1 | order/genus/species),
                    weights = log.mass.weights,
                 data = mammal_bmr)

tab_model(wt_re_model) 
  log10.bmr
Predictors Estimates CI p
(Intercept) 1.82 1.64 – 1.99 <0.001
log10.body.mass 0.75 0.67 – 0.83 <0.001
habitat [land] -0.25 -0.40 – -0.09 0.002
above.10kg [Smaller] 0.06 -0.06 – 0.19 0.321
diet [h] 0.06 0.02 – 0.10 0.002
diet [o] 0.04 0.00 – 0.07 0.030
log10.body.mass * habitat
[land]
0.04 -0.05 – 0.12 0.394
log10.body.mass *
above.10kg [Smaller]
-0.11 -0.19 – -0.02 0.012
Random Effects
σ2 0.00
τ00 species:(genus:order) 0.01
τ00 genus:order 0.01
τ00 order 0.01
ICC 0.99
N species 626
N genus 415
N order 24
Observations 722
Marginal R2 / Conditional R2 0.951 / 1.000
wt_re_anova_results <- car::Anova(wt_re_model)
pander(wt_re_anova_results)
Analysis of Deviance Table (Type II Wald chisquare tests)
  Chisq Df Pr(>Chisq)
log10.body.mass 5091 1 0
habitat 21.38 1 3.775e-06
above.10kg 8.575 1 0.003408
diet 9.461 2 0.008821
log10.body.mass:habitat 0.7255 1 0.3943
log10.body.mass:above.10kg 6.298 1 0.01209

Alternate version: Massive only

Only include species with body mass of 10 kg or more.

big_mammal_bmr <- mammal_bmr %>%
  filter(log10.body.mass >= log10(10)) %>%
  mutate(across(where(is.factor), factor)) 
# this code may generate warnings that are harmless
# https://stackoverflow.com/questions/67040472/warning-in-every-model-of-glmmtmb-givecsparse
big_re_model <- glmmTMB(log10.bmr ~ log10.body.mass * habitat + diet +
                      (1 | order/genus/species),
                      data = big_mammal_bmr)

tab_model(big_re_model) 
  log10.bmr
Predictors Estimates CI p
(Intercept) 1.91 1.65 – 2.17 <0.001
log10.body.mass 0.68 0.58 – 0.78 <0.001
habitat [land] -0.39 -0.63 – -0.14 0.002
diet [h] 0.03 -0.11 – 0.16 0.720
diet [o] -0.04 -0.15 – 0.07 0.467
log10.body.mass * habitat
[land]
0.12 -0.01 – 0.25 0.080
Random Effects
σ2 0.01
τ00 species:(genus:order) 0.00
τ00 genus:order 0.00
τ00 order 0.04
ICC 0.87
N species 58
N genus 52
N order 12
Observations 60
Marginal R2 / Conditional R2 0.823 / 0.977
big_re_anova_results <- car::Anova(big_re_model)
pander(big_re_anova_results)
Analysis of Deviance Table (Type II Wald chisquare tests)
  Chisq Df Pr(>Chisq)
log10.body.mass 387.9 1 2.39e-86
habitat 14.08 1 0.0001755
diet 1.113 2 0.5732
log10.body.mass:habitat 3.07 1 0.07973

PGLS

Read in Tree Data

#Read in trees from Upham et al

tree_path <- paste("data/upham-trees/Completed_5911sp_topoCons_FBDasZhouEtAl")
tree_files <- list.files(tree_path)

all_trees <- list()

for (i in 1:length(tree_files)){
  all_trees[[i]] <- read.tree(paste0(tree_path, '/',
                                     tree_files[i]))
  if (i == 1){
    treeset <- all_trees[[i]]
  }else{
    treeset <- c(treeset, all_trees[[i]])
  }
}

all_tip_labels <- purrr::map(treeset, "tip.label")
all_tip_labels <- purrr::map(all_tip_labels,
                             function(x) 
                               stringr::str_replace_all(x, pattern = '_',
                                                      replacement = ' '))

# get list of species that are in ALL the trees
for (t in 1:length(all_tip_labels)){
  if (t == 1){
    tip_labs <- all_tip_labels[[t]]
  }else{
    tip_labs <- intersect(tip_labs, all_tip_labels[[t]])
  }
}

                            
# keep only the species in mammal_bmr that are in all the trees
# on 4/14 this removes 1 species.
pgls_data <- mammal_bmr %>%
  mutate(animal = as.character(animal)) %>%
  filter(animal %in% tip_labs) %>%
  droplevels()


taxonomy <- read_csv('data/upham-trees/taxonomy_mamPhy_5911species.csv')

Fit models, one model for every tree in our list.

pgls_models <- list()

for (t in c(1:length(treeset))){
  # make sure there is only one row of data per species (why sample and not average -- seems dubious??)
  pgls_rep_data <- pgls_data %>% 
    group_by(animal) %>%
    sample_n(1) %>%
    ungroup
  
  
  #Reduce the tree to only include those species in the data set
  refit_tree <- treeset[[t]]
  refit_tree$tip.label <- str_replace_all(refit_tree$tip.label, '_', ' ')
  refit_tree <- drop.tip(refit_tree, 
                         setdiff(refit_tree$tip.label, 
                                 unique(pgls_rep_data %>% pull(animal))))
  
  #Order the data set so that it is in the same order as the tip labels of the tree
  pgls_rep_data <- left_join(data.frame(tree.tip.label = refit_tree$tip.label),
                             pgls_rep_data,
                             by = c('tree.tip.label' = 'animal'),
                             keep = TRUE)
  
  # fit the model
 pgls_models[[t]] <- tryCatch({
    fittd <- gls(log10.bmr ~ log10.body.mass * habitat + 
                     log10.body.mass * above.10kg + diet ,
                   correlation = corPagel(value = 0.8, 
                                          phy = refit_tree, 
                                          fixed = FALSE, 
                                          form = ~animal),
                   data = pgls_rep_data)
    },
  error = function(cond){
    message(paste('PGLS fit failed for tree', t))
    return(NULL)
  }
  )
}
pgls_models <- pgls_models[-which(lapply(pgls_models, is.null) == T)]

Note: we tried to fit 101 PGLS models (each with a different tree); of these, model fitting failed for 101.

Combine the many PGLS model runs together into one summary combined model according to Rubin’s rule.

# as.mira takes the list of models and create an object to be used by the mice package
pgls_mira <- as.mira(pgls_models)  
# # pool summarise the models using Rubin's rule corrected for small samples
pooled_pgls   <- pool(pgls_mira)
pooled_pgls_summ <- summary(pooled_pgls, type = 'all', conf.int = TRUE)

pander(pooled_pgls_summ %>% select(term, estimate, std.error, `2.5 %`, `97.5 %`, lambda, fmi))
Table continues below
term estimate std.error 2.5 %
(Intercept) 1.694 0.1617 1.377
log10.body.mass 0.7698 0.03897 0.6933
habitatland -0.1655 0.07023 -0.3034
above.10kgSmaller 0.05458 0.05738 -0.05809
dieth 0.005226 0.02093 -0.03588
dieto -0.009785 0.01762 -0.04438
log10.body.mass:habitatland 0.01167 0.04312 -0.07299
log10.body.mass:above.10kgSmaller -0.07717 0.03831 -0.1524
97.5 % lambda fmi
2.012 0.001913 0.004746
0.8463 0.00118 0.004013
-0.02758 0.002059 0.004891
0.1672 0.002718 0.005551
0.04633 0.008937 0.01177
0.02481 0.007298 0.01013
0.09634 0.000957 0.00379
-0.001944 0.001083 0.003916
pgls_anova <- Anova(pgls_mira)
pander(pgls_anova)
Analysis of Deviance Table (Type II tests) (continued below)
  F num df den df missing info
log10.body.mass 4970 1 5931857 0.003943
habitat 10.75 1 10557358 0.002955
above.10kg 3.735 1 2196269 0.00648
diet 0.6599 2 5443751 0.005941
log10.body.mass:habitat 0.07329 1 100675163 0.000957
log10.body.mass:above.10kg 4.056 1 78576839 0.001083
  riv Pr(>F)
log10.body.mass 0.003958 0
habitat 0.002964 0.001043
above.10kg 0.006522 0.05328
diet 0.005976 0.5169
log10.body.mass:habitat 0.0009579 0.7866
log10.body.mass:above.10kg 0.001084 0.044

Alternative approach: using MuMIn to do model averaging. This will weight models according to their information criteria scores instead of weighting them all equally. But we see here the results are nearly identical.

pgls_avg <- model.avg(pgls_models, 
                      rank = function(x) 1)

This gives a model with the same coefficients (same model) that I think we can better make predictions from. The other way (with pool()) is better for getting the ANOVA results we wanted. Two ways of getting the same result, in different packages.

For the PGLS model, we also want to extract the estimate of \(\Lambda\), which tells us about how the phylogeny is affecting the correlation structure.

Previous approach was to take the mean of the estimates of \(\Lambda\) from all the individual fitted models.

lambda <- mean(unlist(purrr::map(pgls_models, 
                                 function(x) x$modelStruct$corStruct)))
lambda
## [1] 0.8891128

According to this simple method our estimate is \(\hat{\Lambda} =\) 0.889.

Model Assessment

It’s not really clear how to even approach this, since we don’t really any longer expect the residuals to be normal or independent. And based on the data we know there’s not a huge issue with linearity. So…ok?

Prediction Plots

There’s a problem here in that the “pooled” model object does NOT have any information stored anymore about the variance-covariance matrix, which is key for making predictions from the model. So for making predictions we need to use package MuMIn and its model.avg() function, just weighting equally all the models from all the trees. This results in a fitted single model object from which prediction is possible with predict(). From initial inspection the coefficient estimates of the averaged model are the same as the previous version averaged with the mice package.

pgls_preds <- predict(pgls_avg, 
                    se.fit = TRUE,
                    newdata = mammal_bmr)

mammal_bmr <- mammal_bmr %>%
  mutate(pgls_fitted = pgls_preds$fit,
         pgls_lo = pgls_preds$fit - 1.96*pgls_preds$se.fit,
         pgls_hi = pgls_preds$fit + 1.96*pgls_preds$se.fit
         )

gf_line(pgls_fitted ~ log10.body.mass,
         color = ~habitat,
         data = mammal_bmr) %>%
  gf_ribbon(pgls_lo + pgls_hi ~ log10.body.mass,
            color = ~habitat, fill = ~habitat) %>%
  gf_facet_grid(~ diet) %>% 
  gf_theme(scale_color_manual(values = my_colors)) %>%
  gf_theme(scale_fill_manual(values = my_colors)) 

gf_point(log10.bmr ~ pgls_fitted,
         data = mammal_bmr,
         alpha = 0.1) %>%
  gf_labs(y = 'Observed log10(BMR)',
          x = 'Model-predicted log10(BMR)',
          title = 'PGLS Model') %>%
  gf_abline(slope = 1, intercept = 0, color = 'black', linetype = 'dashed')

Notice that the CIs are wider here than in the original linear regression model. This is because of appropriate incorporation of the fact that the observations are not independent and there’s phylogenetic signal in the data.

As far as I know though, there is not a way to make predictions for specific species or taxa that are more accurate and account for the “way” in which the species/taxon tends to deviate from the overall average. Instead, our overall uncertainty is adjusted to account for the dependence that we know is in the data. But we can’t (I don’t think) decompose it into a random part and a phylogeny part like we can with the random effects; they are both mixed inextricably together.

Cross-validation

We could consider grouping by by body-mass ranges or phylogeny and then assess predictive accuracy as the measure of “success”

This remains to do.

What’s next?

Want to repeat a similar analysis with other datasets on other response variables of interest in a “physiological hierarchy”:

  • Stroke volume (sv)
  • Heart rate (hr)
  • Breathing Freq (bf)
  • Tidal volume (tv)

All these are other candidate response variables. We started with metabolism, and want to move on to these other metrics.

Main predictor of interest is habitat, controlling for body size and phylogeny: Are aquatic/marine animals different in some notable way?

Data sets for these other variables are smaller (about n = 50 species).